On the optimality of universal classifiers for finite-length individual test sequences
نویسنده
چکیده
On the optimality of universal classifiers for finite-length individual test sequences. Abstract It has been demonstrated that if two individual sequences are independent realizations of two finite-order, finite alphabet, stationary Markov processes, an empirical divergence measure (ZMM) that is based on cross-parsing of one sequence relative to the second one converges to the relative entropy almost surely. This leads to a realization of an empirical, linear complexity universal classifier which is asymptotically optimal in the sense that the probability of classification error vanishes as the length of the sequence tends to infinity if the KL-divergence between the two processes is positive. It is demonstrated that a version of the ZMM is not only asymptotically optimal as the length of the sequences tends to infinity, but is also essentially-optimal for a class of finite-length sequences that are realizations of finite-alphabet, vanishing memory processes with positive transitions in the sense that the probability of classification error vanishes if the length of the sequences is larger than some positive integer No and leads to an asymptotically optimal classification algorithm. At the same time no universal classifier can yield an efficient discrimination between any two distinct processes in this class, if the length of the two sequences N is such that log N < log N o , even if the KL-divergence between the two processes is positive. It is further demonstrated that not every asymptotically optimal universal classification algorithm is also essentially optimal. A variable length (VL) divergence that converges to the KL-divergence when the length of the sequences tends to infinity, is defined. Another universal classification algorithm which, like ZMM is also based on cross-parsing, is shown to be optimal relative to the VL divergence (rather than being just essentially optimal) for any two finite-length sequences that are realizations of vanishing-memory processes.
منابع مشابه
Non-Abelian Sequenceable Groups Involving ?-Covers
A non-abelian finite group is called sequenceable if for some positive integer , is -generated ( ) and there exist integers such that every element of is a term of the -step generalized Fibonacci sequence , , , . A remarkable application of this definition may be find on the study of random covers in the cryptography. The 2-step generalized sequences for the dihedral groups studi...
متن کاملOn Runs in Independent Sequences
Given an i.i.d. sequence of n letters from a finite alphabet, we consider the length of the longest run of any letter. In the equiprobable case, results for this run turn out to be closely related to the well-known results for the longest run of a given letter. For coin-tossing, tail probabilities are compared for both kinds of runs via Poisson approximation.
متن کاملThe Jacobsthal Sequences in Finite Groups
Abstract In this paper, we study the generalized order- Jacobsthal sequences modulo for and the generalized order-k Jacobsthal-Padovan sequence modulo for . Also, we define the generalized order-k Jacobsthal orbit of a k-generator group for and the generalized order-k Jacobsthal-Padovan orbit a k-generator group for . Furthermore, we obtain the lengths of the periods of the generalized order-3 ...
متن کاملA measure of relative entropy between individual sequences with application to universal classification
A new notion of empirical informational divergence (relative entropy) between two individual sequences is introduced. If the two sequences are independent realizations of two finiteorder, finite alphabet, stationary Markov processes, the empirical relative entropy converges to the relative entropy almost surely. This new empirical divergence is based on a version of the Lempel-Ziv data compress...
متن کاملMalware Detection using Classification of Variable-Length Sequences
In this paper, a novel method based on the graph is proposed to classify the sequence of variable length as feature extraction. The proposed method overcomes the problems of the traditional graph with variable length of data, without fixing length of sequences, by determining the most frequent instructions and insertion the rest of instructions on the set of “other”, save speed and memory. Acco...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0909.4233 شماره
صفحات -
تاریخ انتشار 2009